library(tidyverse)
library(readr)
library(lubridate)
library(stringr)
This is my lab notebook for the data analysis project where I will put to use the processes and tools I have found to do a complete data analysis start to finish.
There will be much trial and error and learning as I go. This notebook will be messy.
A few particularly important goals: 1. Define every step of the way, including any necessary steps to complete other steps. 2. Every single function or new idea that I put to use needs to be recorded in order to make flashcards. 3. Practice and formalize my iteration cycle.
My goal based iteration cycle is currently as follows: 1. Set a goal for the task ahead * Define expectations or criteria for success 2. Attempt to complete the goal or perform the task 3. Compare your results to your expectations or your criteria for success 4. (Optional) Use judgment and experience to decide whether you think the expectations should be altered, or the task should be repeated under different parameters 5. Make adjustments
The 5 Stages of Data Analysis: 1. Setting the Stage (get a better name) 2. EDA 3. Build models 4. Interpret and evaluate models 5. Communicate
Initial thoughts on what needs to be accomplished during this stage 1. Ask a question 2. Collect data 3. Inspect the data 4. Refine and sharpen the question 5. Clean and Tidy the data
It’s very important to note that interconnectedness between 1 and 2 above. In a formal setting, the analyst may be given a question to answer, then be tasked with collecting the data, or both may be given at the same time. In our informal, educational setting, we will choose the data first, then find an appropriate question.
I’m going to quickly run through my iteration cycle for finding a data set. 1. Set a goal for the task ahead * Define expectations or criteria for success The goal is to find a data set that from some private sector company that will lend itself well to questions about various performance factors
So I found some Kickstarter data. There are 40 something files, 20 mb each. Let’s start with the first one
So I need to inspect this dataset a little bit First I want to know the date range of this document That requires that I convert all of the dates into a usable format:
Next lets explore the date range. I’ll arbitrarily pick the launch dates to study
Interesting. That seems to go all the way until from 2009 until the date they were scraped.
I guess I’ll have to explore some other files. Perhaps this is a compilation or sample.
So let’s write a couple functions to import everything, then we’ll join the datasets and convert the dates
This is the one that worked:
Boom! Got it imported and combined! That pipe thing is kinda cool.
head(kickstarter)
n_distinct(kickstarter)
[1] 169832
dim(kickstarter)
[1] 169832 33
str(kickstarter)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 169832 obs. of 33 variables:
$ id : int 946464764 432961380 886034377 1700257262 1291954177 773132137 1034315928 140043775 1822914340 1979668691 ...
$ photo : chr "{\"small\":\"https://ksr-ugc.imgix.net/assets/015/025/634/778323e325e49e1cf47cebc49e420b00_original.jpg?crop=faces&w=160&h=90&f"| __truncated__ "{\"small\":\"https://ksr-ugc.imgix.net/assets/015/085/011/0d25261525708ba220b7fdd31a00f2f4_original.jpg?crop=faces&w=160&h=90&f"| __truncated__ "{\"small\":\"https://ksr-ugc.imgix.net/assets/014/998/522/05acd2018b26a29145de61824c87ea3e_original.jpg?crop=faces&w=160&h=90&f"| __truncated__ "{\"small\":\"https://ksr-ugc.imgix.net/assets/015/134/849/c34df652ded9a0f1bacf900a36020e90_original.jpg?crop=faces&w=160&h=90&f"| __truncated__ ...
$ name : chr "100 Fantasy Portraits for Oria Trail the Game" "The Deer Hunter" "Bespoke Pet Portraits – Michael Gardner Design" "Ancient Ones: Knight of Jupiter" ...
$ blurb : chr "Get a fantasy portrait drawn of yourself, a loved one, a D&D or original character and have it added into a video game." "A series of digital paintings by Bryn G Jones" "My campaign is focused on creating bespoke pet portraits. I transform photos of loved pets into unique geometric portraits." "An introduction into the world of Ancient Ones, through the first Limited Edition print of the series, \"Knight of Jupiter\"." ...
$ goal : num 100 25 500 200 1500 100 400 10000 3000 200 ...
$ pledged : num 478 294 501 893 1847 ...
$ state : chr "successful" "successful" "successful" "successful" ...
$ slug : chr "100-fantasy-portraits-for-oria-trail-the-game" "the-deer-hunter" "bespoke-pet-portraits-michael-gardner-design" "ancient-ones-knight-of-jupiter" ...
$ disable_communication : chr "false" "false" "false" "false" ...
$ country : chr "US" "GB" "GB" "US" ...
$ currency : chr "USD" "GBP" "GBP" "USD" ...
$ currency_symbol : chr "$" "£" "£" "$" ...
$ currency_trailing_code: chr "true" "false" "false" "true" ...
$ deadline : int 1486194840 1485990000 1487101859 1487367219 1488040724 1488236400 1488405818 1488648406 1489684237 1491041571 ...
$ state_changed_at : int 1486194840 1485990001 1487101859 1487367221 1488040724 1488236400 1488405818 1488648408 1489684237 1491041571 ...
$ created_at : int 1483170863 1483733879 1482876950 1455213005 1470008219 1485572871 1479854987 1484917222 1486808780 1486396504 ...
$ launched_at : int 1483736396 1484479361 1484509859 1484775219 1485448724 1485617773 1485813818 1486056406 1487095837 1487157171 ...
$ staff_pick : chr "false" "true" "false" "false" ...
$ is_starrable : chr "false" "false" "false" "false" ...
$ backers_count : int 28 16 7 29 42 8 10 64 3 35 ...
$ static_usd_rate : num 1 1.22 1.22 1 1 ...
$ usd_pledged : num 478 358 610 893 1847 ...
$ creator : chr "{\"urls\":{\"web\":{\"user\":\"https://www.kickstarter.com/profile/ithaqualabs\"},\"api\":{\"user\":\"https://api.kickstarter.c"| __truncated__ "{\"urls\":{\"web\":{\"user\":\"https://www.kickstarter.com/profile/1655679628\"},\"api\":{\"user\":\"https://api.kickstarter.co"| __truncated__ "{\"urls\":{\"web\":{\"user\":\"https://www.kickstarter.com/profile/346706187\"},\"api\":{\"user\":\"https://api.kickstarter.com"| __truncated__ "{\"urls\":{\"web\":{\"user\":\"https://www.kickstarter.com/profile/deadlastmedia\"},\"api\":{\"user\":\"https://api.kickstarter"| __truncated__ ...
$ location : chr "{\"country\":\"US\",\"urls\":{\"web\":{\"discover\":\"https://www.kickstarter.com/discover/places/atlanta-ga\",\"location\":\"h"| __truncated__ "{\"country\":\"GB\",\"urls\":{\"web\":{\"discover\":\"https://www.kickstarter.com/discover/places/london-gb\",\"location\":\"ht"| __truncated__ "{\"country\":\"GB\",\"urls\":{\"web\":{\"discover\":\"https://www.kickstarter.com/discover/places/london-gb\",\"location\":\"ht"| __truncated__ "{\"country\":\"US\",\"urls\":{\"web\":{\"discover\":\"https://www.kickstarter.com/discover/places/chicago-il\",\"location\":\"h"| __truncated__ ...
$ category : chr "{\"urls\":{\"web\":{\"discover\":\"http://www.kickstarter.com/discover/categories/art/digital%20art\"}},\"color\":16760235,\"pa"| __truncated__ "{\"urls\":{\"web\":{\"discover\":\"http://www.kickstarter.com/discover/categories/art/digital%20art\"}},\"color\":16760235,\"pa"| __truncated__ "{\"urls\":{\"web\":{\"discover\":\"http://www.kickstarter.com/discover/categories/art/digital%20art\"}},\"color\":16760235,\"pa"| __truncated__ "{\"urls\":{\"web\":{\"discover\":\"http://www.kickstarter.com/discover/categories/art/digital%20art\"}},\"color\":16760235,\"pa"| __truncated__ ...
$ profile : chr "{\"background_image_opacity\":0.8,\"should_show_feature_image_section\":true,\"link_text_color\":null,\"state_changed_at\":1483"| __truncated__ "{\"background_image_opacity\":0.8,\"should_show_feature_image_section\":true,\"link_text_color\":null,\"state_changed_at\":1483"| __truncated__ "{\"background_image_opacity\":0.8,\"should_show_feature_image_section\":true,\"link_text_color\":null,\"state_changed_at\":1482"| __truncated__ "{\"background_image_opacity\":0.8,\"should_show_feature_image_section\":true,\"link_text_color\":\"\",\"state_changed_at\":1488"| __truncated__ ...
$ spotlight : chr "true" "true" "true" "true" ...
$ urls : chr "{\"web\":{\"project\":\"https://www.kickstarter.com/projects/ithaqualabs/100-fantasy-portraits-for-oria-trail-the-game?ref=cate"| __truncated__ "{\"web\":{\"project\":\"https://www.kickstarter.com/projects/1655679628/the-deer-hunter?ref=category_newest\",\"rewards\":\"htt"| __truncated__ "{\"web\":{\"project\":\"https://www.kickstarter.com/projects/346706187/bespoke-pet-portraits-michael-gardner-design?ref=categor"| __truncated__ "{\"web\":{\"project\":\"https://www.kickstarter.com/projects/deadlastmedia/ancient-ones-knight-of-jupiter?ref=category_newest\""| __truncated__ ...
$ source_url : chr "https://www.kickstarter.com/discover/categories/art/digital%20art?ref=category_modal&sort=magic" "https://www.kickstarter.com/discover/categories/art/digital%20art?ref=category_modal&sort=magic" "https://www.kickstarter.com/discover/categories/art/digital%20art?ref=category_modal&sort=magic" "https://www.kickstarter.com/discover/categories/art/digital%20art?ref=category_modal&sort=magic" ...
$ friends : chr NA NA NA NA ...
$ is_starred : chr NA NA NA NA ...
$ is_backing : chr NA NA NA NA ...
$ permissions : chr NA NA NA NA ...
colnames(kickstarter)
[1] "id" "photo" "name"
[4] "blurb" "goal" "pledged"
[7] "state" "slug" "disable_communication"
[10] "country" "currency" "currency_symbol"
[13] "currency_trailing_code" "deadline" "state_changed_at"
[16] "created_at" "launched_at" "staff_pick"
[19] "is_starrable" "backers_count" "static_usd_rate"
[22] "usd_pledged" "creator" "location"
[25] "category" "profile" "spotlight"
[28] "urls" "source_url" "friends"
[31] "is_starred" "is_backing" "permissions"
names(kickstarter)
[1] "id" "photo" "name"
[4] "blurb" "goal" "pledged"
[7] "state" "slug" "disable_communication"
[10] "country" "currency" "currency_symbol"
[13] "currency_trailing_code" "deadline" "state_changed_at"
[16] "created_at" "launched_at" "staff_pick"
[19] "is_starrable" "backers_count" "static_usd_rate"
[22] "usd_pledged" "creator" "location"
[25] "category" "profile" "spotlight"
[28] "urls" "source_url" "friends"
[31] "is_starred" "is_backing" "permissions"
summary(kickstarter)
id photo name blurb
Min. :1.852e+04 Length:169832 Length:169832 Length:169832
1st Qu.:5.396e+08 Class :character Class :character Class :character
Median :1.079e+09 Mode :character Mode :character Mode :character
Mean :1.076e+09
3rd Qu.:1.610e+09
Max. :2.147e+09
goal pledged state slug
Min. : 0 Min. : 0 Length:169832 Length:169832
1st Qu.: 2000 1st Qu.: 40 Class :character Class :character
Median : 5000 Median : 900 Mode :character Mode :character
Mean : 51883 Mean : 10566
3rd Qu.: 15000 3rd Qu.: 5100
Max. :100000000 Max. :13285226
NA's :38
disable_communication country currency currency_symbol
Length:169832 Length:169832 Length:169832 Length:169832
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
currency_trailing_code deadline state_changed_at created_at
Length:169832 Min. :1.241e+09 Min. :1.241e+09 Min. :1.240e+09
Class :character 1st Qu.:1.377e+09 1st Qu.:1.377e+09 1st Qu.:1.368e+09
Mode :character Median :1.424e+09 Median :1.424e+09 Median :1.417e+09
Mean :1.415e+09 Mean :1.415e+09 Mean :1.408e+09
3rd Qu.:1.457e+09 3rd Qu.:1.457e+09 3rd Qu.:1.451e+09
Max. :1.508e+09 Max. :1.503e+09 Max. :1.503e+09
launched_at staff_pick is_starrable backers_count
Min. :1.241e+09 Length:169832 Length:169832 Min. : 0.0
1st Qu.:1.374e+09 Class :character Class :character 1st Qu.: 2.0
Median :1.422e+09 Mode :character Mode :character Median : 16.0
Mean :1.412e+09 Mean : 122.9
3rd Qu.:1.454e+09 3rd Qu.: 68.0
Max. :1.503e+09 Max. :105857.0
static_usd_rate usd_pledged creator location
Min. :0.04564 Min. : 0 Length:169832 Length:169832
1st Qu.:1.00000 1st Qu.: 39 Class :character Class :character
Median :1.00000 Median : 900 Mode :character Mode :character
Mean :1.02256 Mean : 10127
3rd Qu.:1.00000 3rd Qu.: 5073
Max. :1.71641 Max. :13285226
category profile spotlight urls
Length:169832 Length:169832 Length:169832 Length:169832
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
source_url friends is_starred is_backing
Length:169832 Length:169832 Length:169832 Length:169832
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
permissions
Length:169832
Class :character
Mode :character
summary(kickstarter[14:17])
deadline state_changed_at created_at launched_at
Min. :1.241e+09 Min. :1.241e+09 Min. :1.240e+09 Min. :1.241e+09
1st Qu.:1.377e+09 1st Qu.:1.377e+09 1st Qu.:1.368e+09 1st Qu.:1.374e+09
Median :1.424e+09 Median :1.424e+09 Median :1.417e+09 Median :1.422e+09
Mean :1.415e+09 Mean :1.415e+09 Mean :1.408e+09 Mean :1.412e+09
3rd Qu.:1.457e+09 3rd Qu.:1.457e+09 3rd Qu.:1.451e+09 3rd Qu.:1.454e+09
Max. :1.508e+09 Max. :1.503e+09 Max. :1.503e+09 Max. :1.503e+09
format(object.size(kickstarter), units = "Mb")
[1] "701.5 Mb"
object.size(kickstarter)
735593520 bytes
SO that leaves us with a short list of useful tasks for inspecting the dataset: head() str() summary() dim() n_distinct() colnames() or names() object.size() # with format() and units arg
Convert state to a factor
is.factor(kickstarter$state)
[1] FALSE
kickstarter$state <- as.factor(kickstarter$state)
is.factor(kickstarter$state)
[1] TRUE
Convert dates
kickstarter$created_at <- as.POSIXct(as.numeric(kickstarter$created_at), origin = '1970-01-01', tz = 'GMT')
kickstarter$deadline <- as.POSIXct(as.numeric(kickstarter$deadline), origin = '1970-01-01', tz = 'GMT')
kickstarter$state_changed_at <- as.POSIXct(as.numeric(kickstarter$state_changed_at), origin = '1970-01-01', tz = 'GMT')
kickstarter$launched_at <- as.POSIXct(as.numeric(kickstarter$launched_at), origin = '1970-01-01', tz = 'GMT')
head(kickstarter)
Parse Dates
Maybe I don’t need to parse? Seems too easy…
kickstarter %>%
arrange(launched_at) %>%
summarise(min(launched_at), max(launched_at))
kickstarter %>%
arrange(created_at) %>%
summarise(min(created_at), max(created_at))
Normal time between creation and launch?
Back to my first iteration cycle: finding a dataset.
From earlier: I’m going to quickly run through my iteration cycle for finding a data set. 1. Set a goal for the task ahead * Define expectations or criteria for success The goal is to find a data set that from some private sector company that will lend itself well to questions about various performance factors
And a reminder of my iteration process: My goal based iteration cycle is currently as follows: 1. Set a goal for the task ahead * Define expectations or criteria for success 2. Attempt to complete the goal or perform the task 3. Compare your results to your expectations or your criteria for success 4. (Optional) Use judgment and experience to decide whether you think the expectations should be altered, or the task should be repeated under different parameters 5. Make adjustments
Attempt to complete the goal or perform the task Done
Compare your results to your expectations or your criteria for success Kickstarter data about potential startups was definitely not what I had in mind ( I was thinking more along the lines of a single private sector company)
(Optional) Use judgment and experience to decide whether you think the expectations should be altered, or the task should be repeated under different parameters. My interpretation is that the expectations should be altered (and we should stick with the Kickstarter data)
And now let’s work on the question 1. Set a goal for the task ahead * Define expectations or criteria for success The goal is to have a question that will meet several criteria: 1. The answer may provide insight into factors that lead to successful funding on Kickstarter. 2. Can be answered with the current dataset. 3. Provides the opportunity to apply the Process of Data Science 4. Allows for exploration of the similarities and differences between Inferential and Predictive questions. The question should fall into one of the categories, but could be varied slightly to fall into the other for learning purposes. 5. Be sufficiently sharp to provide useful insight.
Attempt to complete the goal or perform the task First pass: be very general. Attempt: What is the factor that best predicts successful funding on Kickstarter?
The answer may provide insight into factors that lead to successful funding on Kickstarter. -I think I’ve accomplished this one. If we answer the question, we would know about the factors that contribute to success. It feels its usefulness is limited by only looking for the single best variable. Perhaps we could change it to “factor or combination of factors”.
New question: What factor or combination of factors best predict successful funding on Kickstarter?
Now we’ll compare it to criteria #2. 2. Can be answered with the current dataset. -This one requires a little research and thought.
names(kickstarter)
[1] "id" "photo" "name"
[4] "blurb" "goal" "pledged"
[7] "state" "slug" "disable_communication"
[10] "country" "currency" "currency_symbol"
[13] "currency_trailing_code" "deadline" "state_changed_at"
[16] "created_at" "launched_at" "staff_pick"
[19] "is_starrable" "backers_count" "static_usd_rate"
[22] "usd_pledged" "creator" "location"
[25] "category" "profile" "spotlight"
[28] "urls" "source_url" "friends"
[31] "is_starred" "is_backing" "permissions"
So what variables have any chance of being useful here? id-no photo-no name-possibly blurb-possibly goal-possibly pledged-possibly state-possibly slug-possibly disable_communication-possibly country-possibly currency-possibly currency_symbol-no currency_trailing_code-no deadline-possibly state_changed_at-possibly created_at-possibly launched_at–possibly staff_pick-possibly is_starrable-possibly backers_count-possibly static_usd_rate-no usd-pledged-possibly creator-possibly location-possibly category-possibly profile-possibly spotlight-possibly urls-no source-url friends-possibly is_starred-possibly is_backing-possibly permissions-possibly
So a lot of those will require more info about what each of those variables actually contain. We can go look, but right now we just need a rough idea about whether or not we have a chance of answering the question. We can explore what information each of those variables contains in the next stage (EDA).
And on to criteria #3: 3. Provides the opportunity to apply the Process of Data Science Since we have nearly completed stage 1, lets walk through the next 4 stages, simply deciding whether we are likely or not to be able to use this question at each stage Stage 2: EDA - yes Stage 3: Build a model - yes Stage 4: Interpret the results - yes Stage 5: Communicate - yes Overall: yes
Criteria 4: 4. Allows for exploration of the similarities and differences between Inferential and Predictive questions. The question should fall into one of the categories, but could be varied slightly to fall into the other for learning purposes. -This one requires a few definitions: An inferential data analysis quantifies whether an observed pattern will likely hold beyond the data set in hand.
Going beyond an inferential data analysis, which quantifies the relationships at population scale, a predictive data analysis uses a subset of measurements (the features) to predict another measurement (the outcome) on a single person or unit.
Shit - I just realized something. As its phrased, I would be trying to perform classification, rather than regression with a continuous outcome. Hmmmm. Is there a variable I could use as an outcome that is continuous? pledged, backers_count, or (mutated) pledged as a percentage of goal. I tlike the pledged as a percentage of goal because it gives degrees of success, instead of just binary. Otherwise, barely failed and really failed would be treated the same. Similarly, barely failed and barely succeeded would be treated as completely different. Ok I like that. Crisis averted. I could also use the binary version as a means of comparison, and simply to show another type of model.
So back to inferential v predictive, as originally phrased, it is a predictive question. This is more useful than attempting to determine whether or not it would hold at the population scale. We can stick with prediction, but show how to use inferential statistics when necessary.
And finally, criteria #5: 5. Be sufficiently sharp to provide useful insight. As it stands, it is not sharp. However, we can sharpen as necessary. We can put together sharper questions in order to test our models (hypotheses). -I might take this one out.
We could also compare to AoDS criteria: Characteristics of a Good Question: 1. Should be of interest to your audience 2. Should not have already been answered 3. Should stem from a plausible framework 4. Should be answerable 5. Specificity
But that sounds a little boring…
I think I have sufficiently met my criteria after the one iteration. Almost too easy.
It’s worth discussing how this process continues through (at least) the next two stages. The purpose of EDA is to explore the variables, identify relationships, and generate ideas to test further. We will come up with more specific questions to test later. Specificity will increase to the point of becoming a hypothesis (or pair of hypotheses).
So onto some cleaning and tidying.
Let’s first compare to tidy principles: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table
We’re actually pretty tidy already. When I go back through, I should find some messy datasets and perform some basic tidying.
Other common data cleaning tasks: outlier checking date parsing missing value imputation
I think I took care of the date parsing, right? When I converted from unix, they were already parsed. But lets get a definition.
So let’s start checking for and playing with missing data. BLAAAAHHHH
kickstarter[!complete.cases(kickstarter),]
kickstarter[complete.cases(kickstarter),]
options(scipen = 999)
colMeans(is.na(kickstarter))
id photo name
0.000000000000 0.000000000000 0.000005888172
blurb goal pledged
0.000017664516 0.000223750530 0.000000000000
state slug disable_communication
0.000000000000 0.000000000000 0.000000000000
country currency currency_symbol
0.000000000000 0.000000000000 0.000000000000
currency_trailing_code deadline state_changed_at
0.000000000000 0.000000000000 0.000000000000
created_at launched_at staff_pick
0.000000000000 0.000000000000 0.000000000000
is_starrable backers_count static_usd_rate
0.000000000000 0.000000000000 0.000000000000
usd_pledged creator location
0.000000000000 0.000000000000 0.004316029959
category profile spotlight
0.000000000000 0.000000000000 0.000000000000
urls source_url friends
0.000000000000 0.000000000000 0.999735032267
is_starred is_backing permissions
0.999735032267 0.999735032267 0.999735032267
head(kickstarter[30:33], 100)
nrow(kickstarter) * mean(is.na(kickstarter$location))
[1] 733
sum(is.na(kickstarter$location))
[1] 733
So what other tasks should I introduce as data cleaning? Parse dates Manipulate strings? Convert to factors? Mutate? Check for outliers? Or wait until EDA?
kickstarter %>%
summarise(max(pledged))
This should be done visually
ggplot(kickstarter, aes(state, pledged)) +
geom_boxplot()
kickstarter %>%
filter(pledged > 7500000)
kickstarter %>%
arrange(desc(pledged))
kickstarter %>%
group_by(state) %>%
summarise(n(), mean(pledged))
We should probably remove live entries since they are not complete.
Since live would indicate a deadline of after 8/15/17, let’s see if we can see how many there are, and compare to those with the live state.
kickstarter %>%
filter(deadline >= ymd("2017-08-15")) %>%
group_by(state) %>%
summarise(n(), mean(pledged))
So some may have been suspended or cancelled, or also reached their goal before the data was scraped. But the vast majority are live. We could check that the other way as well, just to be sure we know what we’re working with.
kickstarter %>%
filter(deadline <= ymd("2017-08-15")) %>%
group_by(state) %>%
summarise(n(), mean(pledged))
Confirmed my suspicion that live records are just ones that haven’t met their deadline as of the time they were scraped. We should remove them.
169832-3573
[1] 166259
kickstarter <- kickstarter %>%
filter(state != "live")
nrow(kickstarter)
[1] 166259
Is the data tidy? Hadley’s Definition of Tidy Data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table
Yes we are pretty much tidy. When i come back through here, I need to come up with (randomly construct) an untidy dataset, and perform some common operations on it.
A little more cleaning I could do is remove some uneccesary columns to make things a little easier to see and work with. Let’s revisit a couple things.
head(kickstarter)
Some possible columns to remove: photo friends is_starred is_backing permissions
A few need to be made to factors: disable_communication country currency currency_symbol currency_trailing_code staff_pick is_starrable spotlight
Let’s make the factors real quickly
kickstarter$state <- as.factor(kickstarter$state)
kickstarter$disable_communication <- as.factor(kickstarter$disable_communication)
kickstarter$country <- as.factor(kickstarter$country)
kickstarter$currency <- as.factor(kickstarter$currency)
kickstarter$currency_symbol <- as.factor(kickstarter$currency_symbol)
kickstarter$currency_trailing_code <- as.factor(kickstarter$currency_trailing_code)
kickstarter$staff_pick <- as.factor(kickstarter$staff_pick)
kickstarter$is_starrable <- as.factor(kickstarter$is_starrable)
kickstarter$spotlight <- as.factor(kickstarter$spotlight)
names(kickstarter)
[1] "id" "photo" "name"
[4] "blurb" "goal" "pledged"
[7] "state" "slug" "disable_communication"
[10] "country" "currency" "currency_symbol"
[13] "currency_trailing_code" "deadline" "state_changed_at"
[16] "created_at" "launched_at" "staff_pick"
[19] "is_starrable" "backers_count" "static_usd_rate"
[22] "usd_pledged" "creator" "location"
[25] "category" "profile" "spotlight"
[28] "urls" "source_url" "friends"
[31] "is_starred" "is_backing" "permissions"
summary(kickstarter)
id photo name blurb
Min. : 18520 Length:166259 Length:166259 Length:166259
1st Qu.: 540413859 Class :character Class :character Class :character
Median :1079698068 Mode :character Mode :character Mode :character
Mean :1076278785
3rd Qu.:1609983854
Max. :2147476221
goal pledged state slug
Min. : 0 Min. : 0 canceled :11090 Length:166259
1st Qu.: 2000 1st Qu.: 40 failed :78341 Class :character
Median : 5000 Median : 912 live : 0 Mode :character
Mean : 50583 Mean : 10546 successful:76055
3rd Qu.: 15000 3rd Qu.: 5118 suspended : 773
Max. :100000000 Max. :13285226
NA's :38
disable_communication country currency currency_symbol
false:165486 US :130797 USD :130797 $ :142428
true : 773 GB : 14543 GBP : 14543 £ : 14543
CA : 6612 EUR : 7256 € : 7256
AU : 3518 CAD : 6612 Fr: 317
DE : 1686 AUD : 3518 kr: 1715
NL : 1369 SEK : 810
(Other): 7734 (Other): 2723
currency_trailing_code deadline state_changed_at
false: 22116 Min. :2009-05-03 06:59:59 Min. :2009-05-03 07:00:17
true :144143 1st Qu.:2013-07-26 12:27:48 1st Qu.:2013-07-26 01:14:34
Median :2015-02-01 17:55:25 Median :2015-01-31 01:00:13
Mean :2014-10-10 03:42:29 Mean :2014-10-08 16:03:35
3rd Qu.:2016-02-04 14:05:14 3rd Qu.:2016-02-02 09:39:32
Max. :2017-10-12 19:39:05 Max. :2017-08-16 04:28:24
created_at launched_at staff_pick
Min. :2009-04-21 17:35:35 Min. :2009-04-24 19:52:03 false:147109
1st Qu.:2013-04-23 16:17:08 1st Qu.:2013-06-25 00:41:18 true : 19150
Median :2014-11-05 08:54:38 Median :2014-12-30 19:37:23
Mean :2014-07-27 13:05:53 Mean :2014-09-06 07:05:32
3rd Qu.:2015-11-16 19:29:57 3rd Qu.:2016-01-02 21:44:01
Max. :2017-08-15 17:08:05 Max. :2017-08-15 18:23:44
is_starrable backers_count static_usd_rate usd_pledged
false:166259 Min. : 0.0 Min. :0.04564 Min. : 0
1st Qu.: 2.0 1st Qu.:1.00000 1st Qu.: 40
Median : 17.0 Median :1.00000 Median : 914
Mean : 123.5 Mean :1.02371 Mean : 10166
3rd Qu.: 69.0 3rd Qu.:1.00000 3rd Qu.: 5100
Max. :105857.0 Max. :1.71641 Max. :13285226
creator location category profile
Length:166259 Length:166259 Length:166259 Length:166259
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
spotlight urls source_url friends
false:90204 Length:166259 Length:166259 Length:166259
true :76055 Class :character Class :character Class :character
Mode :character Mode :character Mode :character
is_starred is_backing permissions
Length:166259 Length:166259 Length:166259
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Some possible columns to remove: photo friends is_starred is_backing permissions
kickstarter <- kickstarter %>%
select(-c(photo, friends, is_starred, is_backing, permissions))
dim(kickstarter)
[1] 166259 28
Let’s also remove profile
kickstarter <- kickstarter %>%
select(-profile)
dim(kickstarter)
[1] 166259 27
head(kickstarter)
https://www.kickstarter.com/discover/categories/art/digital%20art?ref=category_modal&sort=magic {“web”:{“project”:“https://www.kickstarter.com/projects/ithaqualabs/100-fantasy-portraits-for-oria-trail-the-game?ref=category_newest”,“rewards”:“https://www.kickstarter.com/projects/ithaqualabs/100-fantasy-portraits-for-oria-trail-the-game/rewards”}}
Remove urls
kickstarter <- kickstarter %>%
select(-urls)
dim(kickstarter)
[1] 166259 26
head(kickstarter)
And a few that could use some string manipulation: creator location category profile
kickstarter %>%
select(category)
{“urls”:{“web”:{“discover”:“http://www.kickstarter.com/discover/categories/art/digital%20art”}},“color”:16760235,“parent_id”:1,“name”:“Digital Art”,“id”:21,“position”:3,“slug”:“art/digital art”}
{“urls”:{“web”:{“discover”:“http://www.kickstarter.com/discover/categories/art/illustration”}},“color”:16760235,“parent_id”:1,“name”:“Illustration”,“id”:22,“position”:4,“slug”:“art/illustration”}
length(levels(as.factor(kickstarter$category)))
[1] 154
Let’s experiment
extract_category <- function(string) {
output <- vector("character", length(string))
start <- '"name":"'
end <- '"id"'
for (i in seq_along(string)) {
loc_start <- str_locate(string[[i]], start)
loc_end <- str_locate(string[[i]], end)
output[[i]] <- str_sub(string[[i]],
(loc_start[1,1] + 8),
(loc_end[1,1] - 3)
)
}
output
}
kickstarter$category <- extract_category(kickstarter$category)
Extract creator {“urls”:{“web”:{“user”:“https://www.kickstarter.com/profile/ithaqualabs”},“api”:{“user”:“https://api.kickstarter.com/v1/users/246410818?signature=1502949105.be71a5c8169617f9faeec39b384b715b0dcbbdcd”}},“is_registered”:true,“name”:“Christopher \”Michael\" Hall“,”id“:246410818,”avatar“:{”small“:”https://ksr-ugc.imgix.net/assets/007/018/494/e96cc66ca639dfc9055df48505952104_original.jpg?w=160&h=160&fit=crop&v=1461428617&auto=format&q=92&s=947233770a463f383a6e02ce9de927ab“,”thumb“:”https://ksr-ugc.imgix.net/assets/007/018/494/e96cc66ca639dfc9055df48505952104_original.jpg?w=40&h=40&fit=crop&v=1461428617&auto=format&q=92&s=2235a24568ec8aadb35373184933a347“,”medium“:”https://ksr-ugc.imgix.net/assets/007/018/494/e96cc66ca639dfc9055df48505952104_original.jpg?w=160&h=160&fit=crop&v=1461428617&auto=format&q=92&s=947233770a463f383a6e02ce9de927ab“},”slug“:”ithaqualabs“}
extract_creator <- function(string) {
output <- vector("character", length(string))
start <- '"name":"'
end <- '"id"'
for (i in seq_along(string)) {
loc_start <- str_locate(string[[i]], start)
loc_end <- str_locate(string[[i]], end)
output[[i]] <- str_sub(string[[i]],
(loc_start[1,1] + 8),
(loc_end[1,1] - 3)
)
}
output
}
kickstarter$creator <- extract_creator(kickstarter$creator)
kickstarter %>%
select(creator)
Extract location {“country”:“US”,“urls”:{“web”:{“discover”:“https://www.kickstarter.com/discover/places/atlanta-ga”,“location”:“https://www.kickstarter.com/locations/atlanta-ga”},“api”:{“nearby_projects”:“https://api.kickstarter.com/v1/discover?signature=1502924597.cb3fbf5fbe9154ddf04d85e7be879e4403e3fd98&woe_id=2357024”}},“name”:“Atlanta”,“displayable_name”:“Atlanta, GA”,“short_name”:“Atlanta, GA”,“id”:2357024,“state”:“GA”,“type”:“Town”,“is_root”:false,“slug”:“atlanta-ga”}
extract_location <- function(string) {
output <- vector("character", length(string))
start <- '"displayable_name":"'
end <- '"short_name"'
for (i in seq_along(string)) {
loc_start <- str_locate(string[[i]], start)
loc_end <- str_locate(string[[i]], end)
output[[i]] <- str_sub(string[[i]],
(loc_start[1,1] + 20),
(loc_end[1,1] - 3)
)
}
output
}
kickstarter$location <- extract_location(kickstarter$location)
kickstarter %>%
select(location)
Extract profile - Nevermind, get rid of it! {“background_image_opacity”:0.8,“should_show_feature_image_section”:true,“link_text_color”:null,“state_changed_at”:1483170864,“blurb”:null,“background_color”:null,“project_id”:2816416,“name”:null,“feature_image_attributes”:{“image_urls”:{“default”:“https://ksr-ugc.imgix.net/assets/015/025/634/778323e325e49e1cf47cebc49e420b00_original.jpg?crop=faces&w=1552&h=873&fit=crop&v=1483175120&auto=format&q=92&s=9dfd6c8242477cf82c8e8c03d0801a45”,“baseball_card”:“https://ksr-ugc.imgix.net/assets/015/025/634/778323e325e49e1cf47cebc49e420b00_original.jpg?crop=faces&w=560&h=315&fit=crop&v=1483175120&auto=format&q=92&s=28ca75d75ac91e75cbbdcddfa2c6042e”}},“link_url”:null,“show_feature_image”:false,“id”:2816416,“state”:“inactive”,“text_color”:null,“link_text”:null,“link_background_color”:null}
And now convert a couple of those recently extracted character strings to factors
kickstarter$category <- as.factor(kickstarter$category)
kickstarter$location <- as.factor(kickstarter$location)
kickstarter$id <- as.character(kickstarter$id)
summary(kickstarter)
id name blurb goal
Length:166259 Length:166259 Length:166259 Min. : 0
Class :character Class :character Class :character 1st Qu.: 2000
Mode :character Mode :character Mode :character Median : 5000
Mean : 50583
3rd Qu.: 15000
Max. :100000000
NA's :38
pledged state slug disable_communication
Min. : 0 canceled :11090 Length:166259 false:165486
1st Qu.: 40 failed :78341 Class :character true : 773
Median : 912 live : 0 Mode :character
Mean : 10546 successful:76055
3rd Qu.: 5118 suspended : 773
Max. :13285226
country currency currency_symbol currency_trailing_code
US :130797 USD :130797 $ :142428 false: 22116
GB : 14543 GBP : 14543 £ : 14543 true :144143
CA : 6612 EUR : 7256 € : 7256
AU : 3518 CAD : 6612 Fr: 317
DE : 1686 AUD : 3518 kr: 1715
NL : 1369 SEK : 810
(Other): 7734 (Other): 2723
deadline state_changed_at
Min. :2009-05-03 06:59:59 Min. :2009-05-03 07:00:17
1st Qu.:2013-07-26 12:27:48 1st Qu.:2013-07-26 01:14:34
Median :2015-02-01 17:55:25 Median :2015-01-31 01:00:13
Mean :2014-10-10 03:42:29 Mean :2014-10-08 16:03:35
3rd Qu.:2016-02-04 14:05:14 3rd Qu.:2016-02-02 09:39:32
Max. :2017-10-12 19:39:05 Max. :2017-08-16 04:28:24
created_at launched_at staff_pick
Min. :2009-04-21 17:35:35 Min. :2009-04-24 19:52:03 false:147109
1st Qu.:2013-04-23 16:17:08 1st Qu.:2013-06-25 00:41:18 true : 19150
Median :2014-11-05 08:54:38 Median :2014-12-30 19:37:23
Mean :2014-07-27 13:05:53 Mean :2014-09-06 07:05:32
3rd Qu.:2015-11-16 19:29:57 3rd Qu.:2016-01-02 21:44:01
Max. :2017-08-15 17:08:05 Max. :2017-08-15 18:23:44
is_starrable backers_count static_usd_rate usd_pledged
false:166259 Min. : 0.0 Min. :0.04564 Min. : 0
1st Qu.: 2.0 1st Qu.:1.00000 1st Qu.: 40
Median : 17.0 Median :1.00000 Median : 914
Mean : 123.5 Mean :1.02371 Mean : 10166
3rd Qu.: 69.0 3rd Qu.:1.00000 3rd Qu.: 5100
Max. :105857.0 Max. :1.71641 Max. :13285226
creator location category spotlight
Length:166259 Los Angeles, CA : 8705 Web : 4106 false:90204
Class :character New York, NY : 6679 Narrative Film: 2987 true :76055
Mode :character London, UK : 4872 Public Art : 2981
Chicago, IL : 3353 Indie Rock : 2975
San Francisco, CA: 3044 Painting : 2969
(Other) :138873 Pop : 2969
NA's : 733 (Other) :147272
source_url
Length:166259
Class :character
Mode :character
So I want to run back through what I did in stage 1, which I feel I have now completed. A couple levels to focus on: Top level: I think I stuck with the basic activities that I started with, but I will verify that. Second level: I think they were all pretty straight forward, except for the cleaning part. I want to be sure and outline all of the things I did for cleaning so that I can generalize a little bit. And I want to be sure that I did in fact complete the other stages. I also want to see if there were any steps worth listing for any of those. Also, would anything have benefitted from a more formal walk through the iteration process?
To begin: My initial goals for things to accomplish in Stage 1: Initial thoughts on what needs to be accomplished during this stage 1. Ask a question 2. Collect data 3. Inspect the data 4. Refine and sharpen the question (not really a distinct stop - encompassed in the iteration cycle) 5. Clean and Tidy the data
Inspect: head n_distinct dim str colnames/names summary object.size Plotted some dates Some grouped summaries
Clean and Tidy: Converted Dates Characters to factors Check for and deal with missing data Explore and deal with outliers Remove unnecessary data (generalized from useless columns and live/ongoing cases) Manipulate character strings
So I think that the inspect steps and the clean/tidy step could really benefit from a stated goal at the beginning. This would allow use of the iteration cycle and more clearly define the work to be done.
Goal for Inspect: To develop an idea of the potential usefulness of the dataset and identify problems with the dataset.
Goal for Clean and Tidy: To put the dataset into a useful format and fix problems with the dataset.
I like it. I could revisit these on the next pass.
So should I go with a different notebook for the next stage? Nah